One fateful evening on a Thursday night as I (Keva) was studying for a MATH410 quiz with my friend Frances, Frances interuppted my studying to ask me one question: What word do you use when someone drinks water without touching their mouth to the bottle?
My immediate response was Waterfall, unbeknownst of the entire debate that had spiraled in the CSA (Chinese Student Association) discord server. Riots were being staged, murders were being planned and many people started creating a hit list. Why? It seems that a majority of the people from Maryland seem to call this action Airsip rather than Waterfall. In disbelief and shock, I put my MATH410 studying aside and chose to focus on this debate between Waterfall and Airsip. What made matters worse was when someone from New Jersey, just like me, said they called this action Airsip. This man then proceeds to call me a hillbilly and redneck (for kicks and giggles) since I live in South Jersey.
Following these personal attacks, I messaged my friends back at home in NJ and my brother about this debate, and the response I got were surprising to say the least. Apparently, my brother has been calling this action "Fountain" along with some of my other friends. Otherwise, the majority of the response I got from my friends said they called it "Waterfall". This discussion flooded beyond further to my roommates, my collegues and even my teachers.
This debate inspired us to investigate whether different terms are used in different regions of the United States. I was reminded of a dialect quiz I took in Linguistics 200 which showed how certain words and phrases can reveal a person's regional dialect. We made the decision to create our own map to see if we can map this difference and possibly create a chloropleth map could help us create an Isogloss.
Definition according to Wikipedia
Isogloss: Geographic boundary of a certain linguistic feature, such as a pronunciation of a vowel, the meaning of a word, or the use of some morphological or syntactic feature.
If we can create a map that showcases any data about this topic, we could possibly create a new isogloss for the Airsip vs Waterfall debate. The purpose of this project will be to analyze this debate on airsip vs waterfall, while trying to find any possible differences that could possibly occur due to Sex, Age and Height.
Before we get to the actual data set, we would like to first show you the process of scraping poll data from social media. When we first started researhcing this project, we came across a number of different polls on Twitter and Reddit that have hundrends of responses to this debate. We will be taking that data into account as a means to start of our project. Let's start by scraping Reddit poll data from this UMBC poll!
First, install praw by running pip install praw in your terminal.
Import the following statements. We will be using some of them later on for the rest of the project.
import pandas as pd
import numpy as np
import folium
import requests
import re
import matplotlib.pyplot as plt
import seaborn as sns
import praw
from datetime import datetime
from bs4 import BeautifulSoup
Next, we will need access to the Reddit API to be able to make use of the Reddit data. Create a user API token here https://www.reddit.com/prefs/apps using your reddit account.
Once you login, click create app and fill in the following fields:
Click create app. This should take you to a page that will have your client_id, secret and user_agent (name). Copy and paste these into the command below so that you can access Reddit.
reddit = praw.Reddit(client_id='l9Kr08R2JGKyN8LC_dKTOA', client_secret='0_qcB7u1G2zH3MupclxPfpTLrz3Rfg', user_agent='CMSC320')
Let's take a look at a reddit post on the UMBC subreddit. I just want to show you guys the comments on this post. They are hilarious lol.
But more importantly, this is how we access a single thread on Reddit. We must provide a URL of the thread in the submission() function, which will return an instance of Submission that has all the data on the page packed into an object. Very useful later on!
UMBC = reddit.submission(url='https://www.reddit.com/r/UMBC/comments/l8b6mo/waterfall_or_air_sip/')
for top_level_comment in UMBC.comments:
print(top_level_comment.body)
if you vote "air sip" your enrollment has now been rescinded Those who said air sip, what the hell is wrong with you? I think Air Sip is a Montgomery County thing. Waterfall is everywhere else. you aren't sipping the air. you aren't sipping at all. In Nigeria we actually call this 'skying' if you ask me, I wouldn’t call it social distancing OP, here's a poll too. https://www.strawpoll.me/12935217/r Waterfall for life Neither...
Highkey, I agree with the first two statements :) that's just me though.
But as you can see, if this submission did not have a poll, we could just as easily scrape through the comments and note down any instances of waterfall and airsip as our data. However, this would be tedious and annoying so we will just be looking at the poll.
To access the poll, you can use the .poll_data function as follows.
poll_data = UMBC.poll_data
print(f"There are {poll_data.total_vote_count} votes total.")
print("The options are:")
for option in poll_data.options:
print(f"{option} ({option.vote_count} votes)")
There are 323 votes total. The options are: Air sip (86 votes) Waterfall (237 votes)
This gives us an idea of what data is presented in these polls. If you can sucessfuly print what we have above, then we can just as easily create a dataframe for this!
Let's parse this data into string and int data, and then put this into a dataframe. For each of the options, we will add the option to a name indicator list and we will add the number of votes to the number of votes list. Then, we will label each of these with column name and finally create the dataframe instance. Now we have easy access to our data!
name = ['Total']
num = [int(poll_data.total_vote_count)]
for option in poll_data.options:
name.append(str(option))
num.append(int(option.vote_count))
oop = { 'Word': name,
'Votes': num }
umbcdf = pd.DataFrame(oop)
umbcdf
| Word | Votes | |
|---|---|---|
| 0 | Total | 323 |
| 1 | Air sip | 86 |
| 2 | Waterfall | 237 |
I am going to repeat the following with one more reddit post as shown below.
northeastern = reddit.submission(url='https://www.reddit.com/r/NEU/comments/hpehhd/whats_the_correct_word/')
for top_level_comment in northeastern.comments:
print(top_level_comment.body)
Fuck is a birdie that's golf??? Air sip is the noun, waterfall is the verb It's a sky bro. dolla “Can I sky that” neither it's sky ITS AIRSIP Its hockey sip people from Boston say sky and I know we're in Boston or whatever but that's just wrong Oh nah y’all saying waterfall??? How do I transfer schools
Where the hell did sky and birdie come from :,(
Truly horrifying
poll_data = northeastern.poll_data
name = ['Total']
num = [int(poll_data.total_vote_count)]
for option in poll_data.options:
name.append(str(option))
num.append(int(option.vote_count))
oop = { 'Word': name,
'Votes': num }
nedf = pd.DataFrame(oop)
nedf
| Word | Votes | |
|---|---|---|
| 0 | Total | 478 |
| 1 | Air sip | 120 |
| 2 | Birdie | 25 |
| 3 | Waterfall | 333 |
Unfortunately, we could not find a comprehensive dataset that encompasses this entire discussion on a country-wide scale. Also, simply using reddit data to understand this debate would also not be valid. Therefore, we made the decision to conduct our own survey and try to get as many responses as possible to be able to continue with our investigation.
We created and posted a poll to a variety of state and college subreddits to collect more specific data on each person.
Note: Our previous data that we collected helped while data scraping helped us form our google form. For example, we made sure to have an option for any possible answer that we deemed valid, while still having a section for "Other" in case we did not cover all possible answers. Our reasoning for this was because we wanted to make the survey as easy to take as possible in order to motivate people to take our quiz.
We asked surveyees questions about the states they have lived in, how long they have lived there, etc. so that we can find patterns in responses based on location, sex, etc.
Note: After about 200 responses, we changed a couple of our questions because the responses were overly complicated and hard to parse into meaningful data for analysis. Instead of asking to provide all of the other states they lived in, we asked to just provide the state that they lived in the longest.
Posting the poll on reddit was time consuming and posts would often be removed by moderators or we would be banned from subreddits for posting "spam".
Let's start by reading in the CSV file that contains the responses from the poll into a Pandas DataFrame and printing out those results.
data = pd.read_csv("WaterfallvsAirsip.csv")
data
| Timestamp | Birthday (MM/DD/YYYY)\n\ni.e. 01/14/2001 | Height (Answer in Inches Please)\n\ni.e. 60 inches = 5'0 | Sex | Coffee or Tea? | Early Bird or Night Owl | What is your current state of residence? Please answer in abbreviations.\n\ni.e. NJ, MD, PA, NY, CA, etc. | How long have you stayed in this state? Answer in years and can be in decimals! | Which county are you from? COUNTY not country | What state have you lived in the longest? Write down that state(s) in abbreviation. Otherwise, write N/A | If so, how long? Write N/A if you answered "N/A" | What do you call the action shown in the picture above? | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 11/22/2022 3:24:50 | 08/19/2002 | 61.0 | Female | Tea | Night Owl | NJ | 20.00 | Camden | NaN | NaN | Waterfall |
| 1 | 11/22/2022 3:33:28 | 01/10/2002 | 64.8 | Female | Tea | Night Owl | MD | 16.00 | Howard | OH | 4 | Waterfall |
| 2 | 11/22/2022 5:38:43 | 10/26/2002 | 60.0 | Female | Tea | Early Bird | IN | 0.17 | United States | IL | 19 | Waterfall |
| 3 | 11/22/2022 6:35:28 | 10/28/2000 | 73.0 | Male | Neither | Night Owl | MD | 22.00 | St mary’s county | NaN | NaN | Waterfall |
| 4 | 11/22/2022 7:52:55 | 10/23/2002 | 62.0 | Nonbinary | Coffee | Night Owl | MD | 12.50 | Harford | CA | 8 | Waterfall |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | NaN | 12/5/1999 | 60.0 | Female | Tea | Night Owl | NJ | 2.00 | NaN | WI | 10 | Waterfall |
| 996 | NaN | 9/1/2004 | 63.0 | Female | Tea | Night Owl | KS | 5.00 | NaN | CA | 5 | Birdie |
| 997 | NaN | 7/26/1991 | 60.0 | Female | Tea | Early Bird | WA | 6.00 | NaN | UT | 25 | Waterfall |
| 998 | NaN | 9/7/1988 | 71.0 | Male | Tea | Night Owl | TX | 19.00 | NaN | TX | 19 | Waterfall |
| 999 | NaN | 1/4/1990 | 71.0 | Male | Tea | Early Bird | ND | 2.00 | NaN | DE | 15 | Waterfall |
1000 rows × 12 columns
In order to analyze and make sense of our data, we need to clean it first.
We can't expect human responses to be exactly what we want in the format we want them, so it is our job to account for these human errors so that we can properly put the data to use.
We don't need to know when someone took the survey so start by dropping the "Timestamp" column from the dataset.
data.drop(data.columns[[0]], axis=1, inplace=True)
As you can see, the column headers in our dataset are just the questions from the poll.
Change the column headers into more simpler terms so they can easily be referenced in the code later on.
data.columns = ['Birthday', 'Height(in)', 'Sex', 'Coffee/Tea', 'Early Bird/Night Owl', 'Current State of Residence',
'Years in Current State', 'County', 'State of Longest Residence', 'Years in Longest State', 'Action']
Right now, the "Birthday" column is not a datetime object, so we can't use this column in any sort of analysis concerning the impact that a person's date of birth can have in their response to this highly debated question.
Change the "Birthday" column into a datetime object.
data.Birthday = pd.to_datetime(data.Birthday)
data
| Birthday | Height(in) | Sex | Coffee/Tea | Early Bird/Night Owl | Current State of Residence | Years in Current State | County | State of Longest Residence | Years in Longest State | Action | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002-08-19 | 61.0 | Female | Tea | Night Owl | NJ | 20.00 | Camden | NaN | NaN | Waterfall |
| 1 | 2002-01-10 | 64.8 | Female | Tea | Night Owl | MD | 16.00 | Howard | OH | 4 | Waterfall |
| 2 | 2002-10-26 | 60.0 | Female | Tea | Early Bird | IN | 0.17 | United States | IL | 19 | Waterfall |
| 3 | 2000-10-28 | 73.0 | Male | Neither | Night Owl | MD | 22.00 | St mary’s county | NaN | NaN | Waterfall |
| 4 | 2002-10-23 | 62.0 | Nonbinary | Coffee | Night Owl | MD | 12.50 | Harford | CA | 8 | Waterfall |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 1999-12-05 | 60.0 | Female | Tea | Night Owl | NJ | 2.00 | NaN | WI | 10 | Waterfall |
| 996 | 2004-09-01 | 63.0 | Female | Tea | Night Owl | KS | 5.00 | NaN | CA | 5 | Birdie |
| 997 | 1991-07-26 | 60.0 | Female | Tea | Early Bird | WA | 6.00 | NaN | UT | 25 | Waterfall |
| 998 | 1988-09-07 | 71.0 | Male | Tea | Night Owl | TX | 19.00 | NaN | TX | 19 | Waterfall |
| 999 | 1990-01-04 | 71.0 | Male | Tea | Early Bird | ND | 2.00 | NaN | DE | 15 | Waterfall |
1000 rows × 11 columns
Some questions in the poll had the option to type "N/A" if the question did not apply to them or if they were unable to answer the question. However, mispellings are common and need to be replaced.
Replace all of the mispellings of "N/A" with a NaN value.
data = data.replace('N/a', np.nan).replace('n/a', np.nan).replace('no', np.nan).replace('No', np.nan).replace('na', np.nan).replace('Na', np.nan)
For any NaN values in the "State of Longest Residence" column we will replace it with their corresponding values in the "Current State of Residence" column.
This is because answering "N/A" to the "State of Longest Residence" question implies that the surveyee has not lived in another state other than their current state, which means that the current state they are living in is the state they have lived in the longest.
Replace NaN values in the "Years in Longest State" column with their corresponding values in the "Years in Current State" column like so.
data["State of Longest Residence"] = np.where(data["Years in Longest State"].isnull(), data["Current State of Residence"], data["State of Longest Residence"])
data["Years in Longest State"] = np.where(data["Years in Longest State"].isnull(), data["Years in Current State"], data["Years in Longest State"])
data
| Birthday | Height(in) | Sex | Coffee/Tea | Early Bird/Night Owl | Current State of Residence | Years in Current State | County | State of Longest Residence | Years in Longest State | Action | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002-08-19 | 61.0 | Female | Tea | Night Owl | NJ | 20.00 | Camden | NJ | 20.0 | Waterfall |
| 1 | 2002-01-10 | 64.8 | Female | Tea | Night Owl | MD | 16.00 | Howard | OH | 4 | Waterfall |
| 2 | 2002-10-26 | 60.0 | Female | Tea | Early Bird | IN | 0.17 | United States | IL | 19 | Waterfall |
| 3 | 2000-10-28 | 73.0 | Male | Neither | Night Owl | MD | 22.00 | St mary’s county | MD | 22.0 | Waterfall |
| 4 | 2002-10-23 | 62.0 | Nonbinary | Coffee | Night Owl | MD | 12.50 | Harford | CA | 8 | Waterfall |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 1999-12-05 | 60.0 | Female | Tea | Night Owl | NJ | 2.00 | NaN | WI | 10 | Waterfall |
| 996 | 2004-09-01 | 63.0 | Female | Tea | Night Owl | KS | 5.00 | NaN | CA | 5 | Birdie |
| 997 | 1991-07-26 | 60.0 | Female | Tea | Early Bird | WA | 6.00 | NaN | UT | 25 | Waterfall |
| 998 | 1988-09-07 | 71.0 | Male | Tea | Night Owl | TX | 19.00 | NaN | TX | 19 | Waterfall |
| 999 | 1990-01-04 | 71.0 | Male | Tea | Early Bird | ND | 2.00 | NaN | DE | 15 | Waterfall |
1000 rows × 11 columns
It is inevitable that free response polls have inappropriate and duplicate responses. Let's start by removing a duplicate response by request of the surveyee.
This surveyee realized that they had misread the "County" question as "Country", so they submitted their response again, with their mistake corrected.
print(data.iloc[85])
print()
print(data.iloc[86])
Birthday 2003-08-20 00:00:00 Height(in) 60.0 Sex Female Coffee/Tea Tea Early Bird/Night Owl Early Bird Current State of Residence CA Years in Current State 19.0 County USA State of Longest Residence CA Years in Longest State 19.0 Action Waterfall Name: 85, dtype: object Birthday 2003-08-20 00:00:00 Height(in) 60.0 Sex Female Coffee/Tea Tea Early Bird/Night Owl Early Bird Current State of Residence CA Years in Current State 19.0 County Alameda State of Longest Residence N/A (Please delete my previous response!) Years in Longest State N/A (I misread county :3) Action Waterfall Name: 86, dtype: object
Modify her first response (row 85) and delete her second response.
data.iat[85,7] = 'Alameda'
data.iat[85,8] = 'CA'
data.iat[85,9] = 19.0
data.drop(86,axis=0,inplace=True)
data[83:87]
| Birthday | Height(in) | Sex | Coffee/Tea | Early Bird/Night Owl | Current State of Residence | Years in Current State | County | State of Longest Residence | Years in Longest State | Action | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 83 | 2001-10-01 | 69.0 | Male | Tea | Early Bird | IL | 1.5 | Taiwan 🇹🇼 | IL | 1.5 | No term for this |
| 84 | 2000-02-05 | 59.0 | Female | Neither | Early Bird | MD | 22.0 | Montgomery | MD | 22.0 | Airsip |
| 85 | 2003-08-20 | 60.0 | Female | Tea | Early Bird | CA | 19.0 | Alameda | CA | 19.0 | Waterfall |
| 87 | 2003-04-06 | 60.0 | Female | Tea | Night Owl | CA | 19.0 | Alameda County | CA | 19.0 | Waterfall |
Some other rows we can drop are responses from surveyees who do not answer the questions appropriately.
We will remove these three since they do not give much relevant information. (No states were mentioned, height of the person is impossible, etc.)
print(data.loc[19])
print()
print(data.loc[33])
print()
print(data.loc[183])
Birthday 1999-11-11 00:00:00 Height(in) 1.0 Sex Female Coffee/Tea Tea Early Bird/Night Owl Early Bird Current State of Residence RM Years in Current State 7.0 County BTS State of Longest Residence Seokjin oppar Years in Longest State 7 Action Holy juice drip 🤤 Name: 19, dtype: object Birthday 2001-01-07 00:00:00 Height(in) 70.0 Sex Male Coffee/Tea Tea Early Bird/Night Owl Night Owl Current State of Residence MD Years in Current State 17.5 County Here State of Longest Residence My mom Years in Longest State 9 months Action Waterfall Name: 33, dtype: object Birthday 1900-01-01 00:00:00 Height(in) 10000.0 Sex NaN Coffee/Tea Neither Early Bird/Night Owl Early Bird Current State of Residence Xx Years in Current State 999.0 County Elephant State of Longest Residence Bb Years in Longest State 20000001 Action Blowjob Name: 183, dtype: object
Drop the three responses.
data.drop(19,axis=0,inplace=True)
data.drop(33,axis=0,inplace=True)
data.drop(183,axis=0,inplace=True)
Someone said they lived in a state for a school year. So we will approximate to 1 year.
data = data.replace('School year', 1)
We had originally asked the surveyees to provide all of the other states they have lived in, and how many years they lived in each state.
This is why at first, the some of the states and years in "State of Longest Residence" (Column 9) and "Years in Longest State" (Column 10) are separated by commas, where each year corresponds to a state.
For example: Column 9: MD, UT, AR | Column 10: 3, 1, 2 --> this means that they lived in MD for 3 years, UT for 1 year and AR for 2 years.
Let's narrow these values down to just the state they lived in the longest and its corresponding year.
To do this we will:
First, make a list of all U.S. states abbreviated and the full name of all states in the same order.
Note: "MASSACHUSETTS" will intentionally be misspelled as "MASSACHUSSETS" because one surveyee mispelled it.
states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
states_full = [ 'ALASKA', 'ALABAMA', 'ARKANSAS', 'ARIZONA', 'CALIFORNIA', 'COLORADO', 'CONNECTICUT', 'DISTRICT OF COLUMBIA',
'DELAWARE', 'FLORIDA', 'GEORGIA', 'HAWAII', 'IOWA', 'IDAHO', 'ILLINOIS', 'INDIANA', 'KANSAS', 'KENTUCKY',
'LOUISIANA', 'MASSACHUSSETS', 'MARYLAND', 'MAINE', 'MICHIGAN', 'MINNESOTA', 'MISSOURI', 'MISSISSIPPI',
'MONTANA', 'NORTH CAROLINA', 'NORTH DAKOTA', 'NEBRASKA', 'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO', 'NEVADA',
'NEW YORK', 'OHIO', 'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 'RHODE ISLAND', 'SOUTH CAROLINA', 'SOUTH DAKOTA',
'TENNESSEE', 'TEXAS', 'UTAH', 'VIRGINIA', 'VERMONT', 'WASHINGTON', 'WISCONSIN', 'WEST VIRGINIA', 'WYOMING']
Now, begin with the cleaning process by iterating through all rows in the dataset
for idx, row in data.iterrows():
# Here we get rid of the commas separating the state and year values in the current row
txt = str(row["State of Longest Residence"]).split(',')
years = str(row["Years in Longest State"]).split(',')
# Add cleaned states and years in the lists below:
# List of state abbreviations for every response in "State of Longest Residence" in the current row
sts = []
# List of years for every response in "Years in Longest State" in the current row
yrs = []
# Iterate through each value(state/location) in txt
for x in txt:
# The index of x in txt will be used to locate its corresponding value in the list of years
i = txt.index(x)
# Account for any responses with the full state name instead of the abbreviation
# Get rid of any leading and trailing whitespace using strip()
if x.strip().upper() in states_full:
# Look at where x is in the states_full list and use that index to append its corresponding state in abbreviations to the 'sts' list
sts.append(states[states_full.index(x.strip().upper())])
# This regex, allows you to only extract the numbers from the provided years.
# This is because some responses included the word "year" or "month" instead of just a number.
yr = re.findall("(?:\d+(?:\.\d*)?|\.\d+)", years[i].strip())[0]
# Some responses include the number of months, so we convert those into years
if "month" in years[i]:
yr = float(yr)/12.0
yrs.append(yr)
# Check if it is a state abbreviation
elif x.strip().upper() in states:
sts.append(x.strip().upper())
yr = re.findall("(?:\d+(?:\.\d*)?|\.\d+)", years[i].strip())[0]
if "month" in years[i]:
yr = float(yr)/12.0
yrs.append(yr)
# If it is not a state at all then do nothing
# This ensures that anything that is not a state does not get added to a list
else:
continue
# Modify each cell into the cleaner lists of states and years
data.at[idx, "State of Longest Residence"] = sts
data.at[idx, "Years in Longest State"] = yrs
data
| Birthday | Height(in) | Sex | Coffee/Tea | Early Bird/Night Owl | Current State of Residence | Years in Current State | County | State of Longest Residence | Years in Longest State | Action | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002-08-19 | 61.0 | Female | Tea | Night Owl | NJ | 20.00 | Camden | [NJ] | [20.0] | Waterfall |
| 1 | 2002-01-10 | 64.8 | Female | Tea | Night Owl | MD | 16.00 | Howard | [OH] | [4] | Waterfall |
| 2 | 2002-10-26 | 60.0 | Female | Tea | Early Bird | IN | 0.17 | United States | [IL] | [19] | Waterfall |
| 3 | 2000-10-28 | 73.0 | Male | Neither | Night Owl | MD | 22.00 | St mary’s county | [MD] | [22.0] | Waterfall |
| 4 | 2002-10-23 | 62.0 | Nonbinary | Coffee | Night Owl | MD | 12.50 | Harford | [CA] | [8] | Waterfall |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 1999-12-05 | 60.0 | Female | Tea | Night Owl | NJ | 2.00 | NaN | [WI] | [10] | Waterfall |
| 996 | 2004-09-01 | 63.0 | Female | Tea | Night Owl | KS | 5.00 | NaN | [CA] | [5] | Birdie |
| 997 | 1991-07-26 | 60.0 | Female | Tea | Early Bird | WA | 6.00 | NaN | [UT] | [25] | Waterfall |
| 998 | 1988-09-07 | 71.0 | Male | Tea | Night Owl | TX | 19.00 | NaN | [TX] | [19] | Waterfall |
| 999 | 1990-01-04 | 71.0 | Male | Tea | Early Bird | ND | 2.00 | NaN | [DE] | [15] | Waterfall |
996 rows × 11 columns
Now that we have cleaned up those two columns, we can go through each row to see which state was lived in the longest for each response
for idx, row in data.iterrows():
# Convert each list of years to floats so we can compare
lst = [float(x) for x in row["Years in Longest State"]]
# If the list of years is empty, set both the year and state values to NaN
if len(lst)==0:
data.at[idx, "Years in Longest State"] = np.nan
data.at[idx, "State of Longest Residence"] = np.nan
# If there is only one year in the list then the corresponding state is automatically the state most lived in
elif len(lst)==1:
data.at[idx, "Years in Longest State"] = lst[0]
data.at[idx, "State of Longest Residence"] = row["State of Longest Residence"][0]
# If there is more than one year in the list then we compare them and choose the max year
# This max year will replace the current list in "Years in Longest State" and its corresponding state will replace the current list in "State of Longest Residence"
else:
max_yr = max(lst)
data.at[idx, "Years in Longest State"] = max_yr
i = lst.index(max_yr)
data.at[idx, "State of Longest Residence"] = row["State of Longest Residence"][i]
Now that you have cleaned the extraneous data, you can update any NaN values as a result just as you did earlier, where the current state and year replace the longest state and year that are NaN.
data["State of Longest Residence"] = np.where(data["Years in Longest State"].isnull(), data["Current State of Residence"], data["State of Longest Residence"])
data["Years in Longest State"] = np.where(data["Years in Longest State"].isnull(), data["Years in Current State"], data["Years in Longest State"])
data.head(38)
| Birthday | Height(in) | Sex | Coffee/Tea | Early Bird/Night Owl | Current State of Residence | Years in Current State | County | State of Longest Residence | Years in Longest State | Action | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002-08-19 | 61.0 | Female | Tea | Night Owl | NJ | 20.00 | Camden | NJ | 20.0 | Waterfall |
| 1 | 2002-01-10 | 64.8 | Female | Tea | Night Owl | MD | 16.00 | Howard | OH | 4.0 | Waterfall |
| 2 | 2002-10-26 | 60.0 | Female | Tea | Early Bird | IN | 0.17 | United States | IL | 19.0 | Waterfall |
| 3 | 2000-10-28 | 73.0 | Male | Neither | Night Owl | MD | 22.00 | St mary’s county | MD | 22.0 | Waterfall |
| 4 | 2002-10-23 | 62.0 | Nonbinary | Coffee | Night Owl | MD | 12.50 | Harford | CA | 8.0 | Waterfall |
| 5 | 2002-05-20 | 60.0 | Female | Tea | Early Bird | MD | 20.50 | St. Mary’s County | MD | 20.5 | Waterfall |
| 6 | 2002-05-03 | 66.0 | Female | Tea | Night Owl | MD | 20.50 | United States | MD | 20.5 | Waterfall |
| 7 | 2001-09-04 | 76.0 | Male | Tea | Night Owl | MD | 21.00 | USA | MD | 21.0 | Waterfall |
| 8 | 2004-08-03 | 65.0 | Female | Neither | Night Owl | MD | 5.00 | Howard | TX | 5.0 | Waterfall |
| 9 | 2002-04-18 | 62.0 | Female | Tea | Night Owl | MD | 7.40 | United States | KY | 6.0 | Waterfall |
| 10 | 2002-11-25 | 76.0 | Male | Tea | Night Owl | MD | 10.00 | USA | WI | 1.5 | Waterfall |
| 11 | 2008-07-14 | 61.0 | Female | Neither | Night Owl | MD | 5.00 | India | MD | 5.0 | Airsip |
| 12 | 1977-05-04 | 64.0 | Female | Coffee | Night Owl | MD | 17.00 | India | OH | 4.0 | Waterfall |
| 13 | 2001-11-30 | 68.0 | Male | Tea | Night Owl | NJ | 10.00 | US | NY | 2.0 | Airsip |
| 14 | 2006-12-26 | 62.0 | Female | Coffee | Night Owl | MD | 15.92 | Howard | MD | 15.92 | Waterfall |
| 15 | 2001-04-27 | 62.0 | Female | Coffee | Night Owl | MD | 18.00 | Peru | MD | 18.0 | Waterfall |
| 16 | 2001-04-02 | 61.0 | Female | Tea | Night Owl | MD | 21.00 | US | MD | 21.0 | Waterfall |
| 17 | 2002-11-08 | 69.0 | Female | Tea | Night Owl | MD | 19.00 | Montgomery | TX | 1.0 | Airsip |
| 18 | 2002-01-15 | 71.0 | Male | Coffee | Night Owl | MD | 20.90 | United States | MD | 20.9 | Airsip |
| 20 | 2002-02-13 | 67.0 | Female | Tea | Night Owl | MD | 19.20 | USA | MD | 19.2 | Airsip |
| 21 | 2002-04-01 | 66.0 | Female | Coffee | Night Owl | MD | 20.00 | United States of America | MD | 20.0 | Airsip |
| 22 | 2002-08-21 | 67.0 | Female | Tea | Night Owl | MD | 18.00 | Montgomery County | VA | 2.0 | Airsip |
| 23 | 2003-08-29 | 72.0 | Male | Neither | Night Owl | PA | 16.00 | USA | MN | 3.0 | Waterfall |
| 24 | 2001-04-26 | 68.0 | Male | Tea | Night Owl | MD | 21.58 | St. Mary’s | MD | 21.58 | Waterfall |
| 25 | 2001-09-18 | 64.0 | Female | Coffee | Night Owl | MD | 21.00 | Montgomery County | MD | 21.0 | Airsip |
| 26 | 2001-04-30 | 69.0 | Male | Neither | Night Owl | MD | 21.75 | Anne Arundel | MD | 21.75 | Waterfall |
| 27 | 2003-06-03 | 63.0 | Female | Tea | Night Owl | MD | 18.00 | China | MD | 18.0 | Waterfall |
| 28 | 2002-05-26 | 64.0 | Female | Coffee | Early Bird | MD | 17.50 | Montgomery | NE | 3.0 | Airsip |
| 29 | 2002-07-25 | 63.0 | Female | Coffee | Night Owl | MD | 20.00 | Montgomery | MD | 20.0 | Airsip |
| 30 | 2002-07-20 | 66.0 | Female | Tea | Early Bird | TX | 20.00 | Collin | TX | 20.0 | Waterfall |
| 31 | 2002-08-02 | 70.0 | Male | Coffee | Night Owl | NJ | 4.00 | Middlesex | MA | 7.0 | Waterfall |
| 32 | 2001-03-12 | 64.0 | Female | Neither | Early Bird | MD | 20.00 | Wicomico | MD | 20.0 | Waterfall |
| 34 | 2001-04-02 | 61.0 | Female | Tea | Night Owl | MD | 21.00 | Prince George's County | MD | 21.0 | Waterfall |
| 35 | 1985-01-15 | 69.0 | Male | Coffee | Night Owl | NJ | 29.00 | Morris | MD | 8.0 | Waterfall |
| 36 | 2003-12-29 | 68.0 | Male | Neither | Night Owl | MD | 18.80 | US | MD | 18.8 | Airsip |
| 37 | 2001-04-27 | 62.0 | Female | Coffee | Night Owl | MD | 18.00 | Wicomico | MD | 18.0 | Waterfall |
| 38 | 2002-01-15 | 71.0 | Male | Coffee | Early Bird | MD | 20.90 | Montgomery | MD | 20.9 | Airsip |
| 39 | 2001-09-20 | 72.0 | Male | Tea | Night Owl | MD | 21.00 | America | MD | 21.0 | Waterfall |
Now that those two columns are cleaned, let's take a look at the values in the "Current State of Residence" column.
print(pd.unique(data["Current State of Residence"]))
['NJ' 'MD' 'IN' 'PA' 'TX' 'VA' 'IL' 'Tx' 'WA' 'KS' 'CA' 'NV' 'ca' 'NE' 'Nv' 'NH' 'NC' 'MA' 'Md' 'ID' 'IA' 'MT' 'Id' 'OK' 'Ak' 'UT' 'AK' 'WY' 'AZ' 'ME' 'DC' 'WI' 'OH' 'WV' 'MI' 'TN' 'FL' 'AL' 'NM' 'GA' 'DE' 'NY' 'MO' 'VT' 'MS' 'SC' 'HI' 'ND' 'OR' 'KY' 'RI' 'SD' 'CT' 'AR' 'CO' 'LA' 'MN']
Since the state abbreviations are not all capitalized, capitalize them.
data["Current State of Residence"] = data["Current State of Residence"].apply(lambda x: x.upper())
# Now they are all upper case.
print(pd.unique(data["Current State of Residence"]))
['NJ' 'MD' 'IN' 'PA' 'TX' 'VA' 'IL' 'WA' 'KS' 'CA' 'NV' 'NE' 'NH' 'NC' 'MA' 'ID' 'IA' 'MT' 'OK' 'AK' 'UT' 'WY' 'AZ' 'ME' 'DC' 'WI' 'OH' 'WV' 'MI' 'TN' 'FL' 'AL' 'NM' 'GA' 'DE' 'NY' 'MO' 'VT' 'MS' 'SC' 'HI' 'ND' 'OR' 'KY' 'RI' 'SD' 'CT' 'AR' 'CO' 'LA' 'MN']
Now, lets take a look at the values in the "Actions" column.
print(pd.unique(data["Action"]))
['Waterfall' 'Airsip' 'gluck gluck 3000' 'No term for this' 'Birdie' 'Birdie Sip' 'Sky "Let me sky that"' 'Fountain' 'Airdrink 🙃' 'I use waterfall and airship interchangeably' 'Pourgnorgin' 'Chug' 'Sky' 'Pop' 'Airdrink']
It seems that we can simplify some of these.
Make it so that:
data.at[115,"Action"] = 'Birdie'
data.at[127,"Action"] = 'Sky'
data.at[152,"Action"] = 'Airdrink'
data.at[178,"Action"] = 'Airsip'
Now that our data has been cleaned up, we can take this data and start creating our choropleth map. In this section, we will learn how to use plotly to create our choropleth map, and this starts off by taking our clean data and grouping everything by the current state of residence and action. This will give us a frequency table of how many responses we received for each state and what word they use.
num_in_states = data.groupby(['Current State of Residence', 'Action']).size().reset_index()
num_in_states
| Current State of Residence | Action | 0 | |
|---|---|---|---|
| 0 | AK | Airdrink | 1 |
| 1 | AK | Fountain | 1 |
| 2 | AK | Waterfall | 20 |
| 3 | AL | Airsip | 1 |
| 4 | AL | Pop | 1 |
| ... | ... | ... | ... |
| 134 | WV | Sky | 1 |
| 135 | WV | Waterfall | 14 |
| 136 | WY | Airsip | 1 |
| 137 | WY | No term for this | 1 |
| 138 | WY | Waterfall | 9 |
139 rows × 3 columns
We will make a choropleth map to map out the frequency for each of these action terms. We will be using a module called plotly.express that will allow us to create choropleth maps.
First, import the module as shown below
import plotly.express as px
For the purposes of showing the process, I will show how to do this for one of the possible actions. Let's filter our dataset num_in_states to just have the values for "Airdrink". For readability, I changed the columns names. Finally, we need to get the sum of the data frequnecies overall for that action value. Use the .sum function on the dataframe column to do so.
airdrink = num_in_states[num_in_states["Action"] == "Airdrink"]
airdrink.columns = ['Residence', 'Action', 'Count']
summ = airdrink['Count'].sum()
airdrink
| Residence | Action | Count | |
|---|---|---|---|
| 0 | AK | Airdrink | 1 |
| 9 | CA | Airdrink | 1 |
| 13 | CO | Airdrink | 1 |
| 15 | CT | Airdrink | 1 |
| 52 | MD | Airdrink | 1 |
| 72 | MS | Airdrink | 1 |
| 86 | NJ | Airdrink | 3 |
| 92 | NV | Airdrink | 1 |
| 102 | OK | Airdrink | 1 |
| 105 | OR | Airdrink | 1 |
| 129 | WA | Airdrink | 5 |
| 133 | WV | Airdrink | 1 |
Now, we will create a choropleth map. Use the choropleth and edit the following fields
Change the layout of the figure to make it readable and then show the figure. Make sure to label your figure with a unique title!
fig = px.choropleth(airdrink, locations="Residence",
locationmode="USA-states",
scope="usa",
color="Count",
range_color=(0,80),
color_continuous_scale="agsunset_r")
fig.update_layout(
title_text = 'Frequency of Airdrink Users in the United States',
title_font_size = 22,
title_font_color="black",
title_x=0.45,
)
fig.show()
Let's repeat the following for all the different possible options! Using Loop to go through all possible values with the same code from above
groupbyaction = data.groupby(['Action']).size().reset_index()
actiondata = np.array(groupbyaction["Action"])
names = np.unique(actiondata)
for name in names:
if (name == "Chug") :
continue
elif(name == "Pourgnorgin") :
continue
elif(name == "gluck gluck 3000"):
continue
elif(name == "Airdrink"):
continue
filterbyname = num_in_states[num_in_states["Action"] == name]
filterbyname.columns = ['Residence', 'Action', 'Count']
summ = filterbyname['Count'].sum()
fig = px.choropleth(filterbyname, locations="Residence",
locationmode="USA-states",
scope="usa",
range_color=(0,70),
color="Count",
color_continuous_scale="agsunset_r")
fig.update_layout(
title_text = f'Frequency of {name} Users in the United States',
title_font_size = 22,
title_font_color="black",
title_x=0.45,
)
fig.show()
Analysis of Choropleth Maps
Looking at the graphs, it seems waterfall is the most prevalent, with each state having at least 1 response. Additionally, a good majority of states had responses where people did not have a term for the action. Interestingly enough, the southern states below Oklahoma and Tennessee did not have answer responses to "No term for this". Massachusetts has a considerable number of responses for "Sky", more than the responses for "Waterfall" which is a possible indicator for an isogloss boundary. In a similar manner, Pennsylvania has around 5 responses for "Pop", but compared to it's counterpart for "Waterfall", does not have as many. However overall, the majority of responses seem to be Waterfall, with the exception of Maryland having a lare number of Airsip users. Based on the current data, there is reason to believe that Airsip is from the Maryland area. One other thing I would like to note is the lack of any responses for "Airsip" on the West coast. This could possibly be another isogloss, with differences from the East and West Coast.
Note: In the for loop, we ignore values like Chug, Pourgnorgin and gluck gluck 3000 since there was only one response for these values. Additionally, we overlooked Airdrink since we took care of this out of the while loop.
We also flipped the color scale for the color_continuous_scale field by adding a _r at the end of the color scheme name. This allows it so that lower frequencies have lighter colors and higher frequencies have darker colors. As a result, higher frequencies stand out more. I also changed the scale to be consistent for each graph so that similar colors line up with similar frequencies.
Since we conducted our own survey, a majority of the survey was administered in Maryland, which can be problematic when looking at this on a country-wide scale. If we were to do this again, we would want to have our data collected in a much better format without the extra questions and just the State and Word. This would make it easier on google form users who probably did not want to take our survey.
Let's do some data analysis now! We want to see which of the attributes has the strongest correlation to the Action Terms. We will show how the Action Terms vary for each of the attributes such as Sex, Age and Drink of Preference.
We want to compare the Action Term and Sex now. With this new dataframe below, we can see the counts of how many people in each sex define the action. There were two rows with the Sex "Nonbinary" but they were formatted differently, so we combined them and dropped that second instance. We also ended up dropping the row with the Sex as "yes please" because that is not a valid option that we are considering for this project.
# Making the new df
gen_act = data.groupby(['Sex', 'Action']).size().reset_index(name='Counts')
gen_act.at[19,'Counts']= 2
gen_act = gen_act.drop([20,21])
gen_act
| Sex | Action | Counts | |
|---|---|---|---|
| 0 | Female | Airdrink | 9 |
| 1 | Female | Airsip | 60 |
| 2 | Female | Birdie | 12 |
| 3 | Female | Fountain | 7 |
| 4 | Female | No term for this | 18 |
| 5 | Female | Pop | 3 |
| 6 | Female | Sky | 9 |
| 7 | Female | Waterfall | 407 |
| 8 | Male | Airdrink | 9 |
| 9 | Male | Airsip | 39 |
| 10 | Male | Birdie | 5 |
| 11 | Male | Chug | 1 |
| 12 | Male | Fountain | 3 |
| 13 | Male | No term for this | 20 |
| 14 | Male | Pop | 10 |
| 15 | Male | Pourgnorgin | 1 |
| 16 | Male | Sky | 4 |
| 17 | Male | Waterfall | 375 |
| 18 | Male | gluck gluck 3000 | 1 |
| 19 | Nonbinary | Waterfall | 2 |
Now, let's plot the data into a Bar Plot.
# Making the barplot and formating the graph so it is readable
plt.figure(figsize=(16,8))
sns.barplot(data=gen_act, x="Action", y="Counts", hue="Sex")
plt.title("Action Term among All Given Sexes from Multiple States and Ages", fontsize = 12)
plt.xlabel("Action Term", fontsize = 12)
plt.ylabel("Number of People", fontsize = 12)
Text(0, 0.5, 'Number of People')
For this attribute (Sex), we decided to use a Bar Plot instead of a Violin plot for a couple reasons. First, we have 3 different Sexes given to us, Female, Male, and Nonbinary. Due to this, the representation of data that would best fit would be a Bar Plot. We can see that the term "Waterfall" is the most popular among Males and Females. From this plot, we can see that for "Airsip", "Waterfall", "Birdie", "Fountain", and "Sky", Females had the highest count. We can also see that for "No term for this", "Pop", "Chug", "Pourgnorgin", and "gluck gluck 3000", Males had the highest count. For "Airdrink", there is a tie for Males and Females. The Nonbinary sex had two entries, both under "Waterfall".
Now, let's compare the Action Term against Age. The first thing we need to do is calculate the ages of the respondants because we do not have a column for that yet. We can make this by using their Birthdays which we required them to input. We wanted to create a new dataframe to look at the age distribution before comparing it against the Action Term. Below, we made this dataframe and transposed it to have the ages as the columns.
## Calculating the Age from Birthdays and adding it to a new column in df
today = datetime.today()
data['Age'] = data['Birthday'].apply(lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))
ages = data.groupby(['Age']).size().reset_index(name='Count')
# ^ This will get the number of times each Age occurs in our data
ages = ages.transpose()
pd.set_option('display.max_columns', None)
ages.columns = ages.iloc[0]
ages = ages[1:]
ages
| Age | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 51 | 53 | 63 | 67 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Count | 1 | 2 | 15 | 16 | 72 | 91 | 119 | 121 | 78 | 81 | 75 | 54 | 12 | 12 | 12 | 11 | 20 | 18 | 11 | 10 | 13 | 13 | 9 | 7 | 16 | 15 | 7 | 13 | 6 | 10 | 5 | 17 | 14 | 15 | 1 | 1 | 1 | 1 | 1 |
plt.figure(figsize=(16,8))
sns.displot(data, x="Age", bins = 50)
<seaborn.axisgrid.FacetGrid at 0x7f818da59e80>
<Figure size 1152x576 with 0 Axes>
Just taking a look at our data, we see that most of the respondants are in aged 18-25 years old. We should keep this in mind when we are looking at the violin plots for each term. We are expecting to see unimodal or bimodal violins with the highest expectancies for each term around the ages of 18-25.
# Plotting to compare terms and highest expectacny for ages
plt.figure(figsize=(16,8))
sns.violinplot(data=data, x="Action", y="Age")
plt.title("Violin Plot for Action Term Across Ages", fontsize = 12)
plt.xlabel("Action Term", fontsize = 12)
plt.ylabel("Age (in Years)", fontsize = 12)
Text(0, 0.5, 'Age (in Years)')
From this violin plot, we can see that the term "Waterfall" is unimodal, used mostly among people around the age of 20, which is where the media lies. "Waterfall" is also somewhat used evenly across ages 30-48 years old. The term "Airsip" is also unimodal, used very heavily for people around the age of 20. The "Airsip" distribution across ages is not a lot. We see that the term "gluck gluck 3000" is on the x-axis but there is no violin plot for it. This is because only one person responded with that. For the response "No term for this," we can see that the ages range all the way up to 70 years-old. It looks unimodal, between 20-30 years-old, but it is not as prominent as "Waterfall" or "Airsip". The median for this term lies around 25 years-old. The term "Birdie" is bimodal, but the higher probability is around the ages 20-30 years-old. The second highest is around 40 years-old. The term "Sky" is also biomodal.
# We need a new dataframe for the matrix
df = data
# Turning the types into numerical values so the .corr() function can do the calculations
df['Current State'] = data['Current State of Residence'].astype('category').cat.codes
df['Gender'] = data['Sex'].astype('category').cat.codes
df['Drink'] = data['Coffee/Tea'].astype('category').cat.codes
df['Early/Night'] = data['Early Bird/Night Owl'].astype('category').cat.codes
df['Longest State'] = data['State of Longest Residence'].astype('category').cat.codes
df['Years in Longest State'] = df['Years in Longest State'].astype(int)
df['Term'] = data['Action'].astype('category').cat.codes
df = df.drop(columns = ['Current State of Residence', 'Action', 'Sex', 'Coffee/Tea', 'Early Bird/Night Owl', 'State of Longest Residence'])
# Making the correlation matrix here
correlation_matrix = df.corr()
correlation_matrix
| Height(in) | Years in Current State | Years in Longest State | Age | Current State | Gender | Drink | Early/Night | Longest State | Term | |
|---|---|---|---|---|---|---|---|---|---|---|
| Height(in) | 1.000000 | 0.031351 | 0.025715 | 0.006320 | 0.047571 | 0.268460 | -0.039771 | 0.024336 | 0.053489 | 0.031867 |
| Years in Current State | 0.031351 | 1.000000 | 0.679708 | 0.290059 | -0.017315 | 0.012880 | -0.017395 | 0.024233 | 0.007441 | -0.053148 |
| Years in Longest State | 0.025715 | 0.679708 | 1.000000 | 0.501793 | 0.018223 | 0.007497 | -0.036944 | -0.056191 | -0.017370 | 0.005489 |
| Age | 0.006320 | 0.290059 | 0.501793 | 1.000000 | 0.021232 | 0.081062 | -0.055607 | -0.076273 | 0.004848 | 0.063689 |
| Current State | 0.047571 | -0.017315 | 0.018223 | 0.021232 | 1.000000 | 0.067509 | 0.007301 | -0.054083 | 0.494520 | 0.059341 |
| Gender | 0.268460 | 0.012880 | 0.007497 | 0.081062 | 0.067509 | 1.000000 | -0.012358 | 0.034360 | 0.053996 | 0.032041 |
| Drink | -0.039771 | -0.017395 | -0.036944 | -0.055607 | 0.007301 | -0.012358 | 1.000000 | 0.018147 | 0.038709 | 0.006894 |
| Early/Night | 0.024336 | 0.024233 | -0.056191 | -0.076273 | -0.054083 | 0.034360 | 0.018147 | 1.000000 | 0.017961 | -0.014797 |
| Longest State | 0.053489 | 0.007441 | -0.017370 | 0.004848 | 0.494520 | 0.053996 | 0.038709 | 0.017961 | 1.000000 | 0.061732 |
| Term | 0.031867 | -0.053148 | 0.005489 | 0.063689 | 0.059341 | 0.032041 | 0.006894 | -0.014797 | 0.061732 | 1.000000 |
Now, let's turn this into a heatmap using the Seaborn Heatmap function.
cmap = sns.cm.rocket_r
plt.figure(figsize=(16,8))
sns.heatmap(correlation_matrix, annot=True, cmap=cmap)
<AxesSubplot:>
In creating this heatmap, it originally displayed it with the weakest correlation as the darkest colors and the strongest with the lightest colors. This is not how we wanted to display our heatmap. We want to display the strongest correlation as the darkest color because we want to emphasize these cells the most. So to fix this issue, we flipped the color scale.
Now, looking at this correlation heatmap, we want to look at the far right column or the last row. This is because we want to see how each of the attributes correlates to the Action Term. The Term and Term cell should be the darkest cell because it has the strongest possible correlation with each other. We can see that this is a pattern down the diagonal of the matrix, the attributes have the strongest correlations with themselves. Now, in the Term column, we want the darkest cells aside from the Term attribute.
At first glance, not looking at the values, the next darkest cells are for Longest State, Age, and Current State. As a group, we were expecting the strongest correlation to be with either Current State or Longest State. But from this heatmap, we can see that the strongest attribute is Age at 0.64, greater than Longest State by 0.002.
After comparing the choropleth maps and running an analysis on the data, our result ultimately only encompass and can be used to summarize our dataset. We did learn one thing out of this project: Waterfall wins. While this data cannot translate into making conclusions for the rest of the United States, this can be used to further guide future study for this topic if (we probably will for fun) we were to redo this experiment.
In all seriousness, based on the data, it seems that we can create an isogloss for "Sky", "No Term for This" and "Airsip". We would create an isogloss for Sky that would encompass Massachusetts. As for "No Term for This", we would draw a line around Idaho and Iowa. Finally, we would create an isogloss for "Airsip" that encompasses the DMV area (DC, Maryland, Virginia).
If we were to do this experiment again, we would send out a survey that just asks the following three questions:
We would not add the filler questions since this seemed to cause a lot of controversy in the realm of Reddit. It was a matter of data privacy and feasibility. The form was way too long and it deterred people from taking the survey.
Isoglosses are used in linguistics to study the distribution of linguistic features within a language or a group of related languages. Analyzing isoglosses can provide insight into the history and evolution of a language or dialect, as well as the social, cultural, and geographic factors that have influenced its development. Isoglosses can also be used to identify and describe different dialects within a language, and to understand how these dialects differ from each other and from the standard form of the language. By analyzing isoglosses, linguists can gain a better understanding of the diversity and complexity of language, and how it is shaped by the people who use it.
Understanding dialect differences can help us better help individuals based on their area of residence. This is really useful for things like Natural Language Processing (NLP). In NLP, isoglosses can be used to identify and classify different language varieties or dialects, which can be useful in a variety of applications. For example, in language identification tasks, isoglosses can be used to distinguish between different language varieties or dialects and classify text accordingly. In language translation, isoglosses can be used to identify regional variations and ensure that the translation accurately reflects the intended dialect or language variety. In text classification tasks, isoglosses can be used to identify and classify text by regional dialect or language variety, which can be useful for tasks such as sentiment analysis or opinion mining. Isoglosses can also be used to improve the performance of NLP models by providing additional context and information about the language being used.
Thank you for reading through this entire project! We hope you enjoyed :)